ARTICLE 



doi:10.1038/naturel3772 



Synaptic, transcriptional and chromatin 
genes disrupted in autism 

A list of authors and their affiliations appears at the end of the paper 

The genetic architecture of autism spectrum disorder involves the interplay of common and rare variants and their 
impact on hundreds of genes. Using exome sequencing, here we show that analysis of rare coding variation in 3,871 
autism cases and 9,937 ancestry-matched or parental controls implicates 22 autosomal genes at a false discovery rate 
(FDR) < 0.05, plus a set of 107 autosomal genes strongly enriched for those likely to affect risk (FDR < 0.30) . These 107 
genes, which show unusual evolutionary constraint against mutations, incur de novo loss -of- function mutations in over 
57o of autistic subjects. Many of the genes implicated encode proteins for synaptic formation, transcriptional regulation 
and chromatin- remodelling pathways. These include voltage-gated ion channels regulating the propagation of action 
potentials, pacemaking and excitability-transcription coupling, as well as histone- modifying enzymes and chromatin 
remodellers— most prominently those that mediate post-translational lysine methylation/ demethylation modifications 
of histones. 



Features of subjects with autism spectrum disorder (ASD) include com- 
promised social communication and interaction. Because the bulk of 
risk arises from de novo and inherited genetic variation^"^^, character- 
izing which genes are involved informs ASD neurobiology and reveals 
part of what makes us social beings. 

Whole-exome sequencing (WES) studies have proved fruitful in uncov- 
ering risk- conferring variation, especially by enumerating de novo vari- 
ation, which is sufficiently rare that recurrent mutations in a gene provide 
strong evidence for a causal link to ASD. De novo loss-of-function (LoF) 
single-nucleotide variants (SNVs) or insertion/deletion (indel) vari- 
ants^ ^"^^ are found in 6.7% more ASD subjects than in matched controls 
and implicate nine genes from the first 1,000 ASD subjects analysed^ 
Moreover, because there are hundreds of genes involved in ASD risk, 
ongoing WES studies should identify additional ASD genes as an almost 
linear function of increasing sample size^\ 

Here we conduct the largest ASD WES study so far, analysing 16 sam- 
ple sets comprising 15,480 DNA samples (Supplementary Table 1 and 
Extended Data Fig. 1). Unlike earlier WES studies, we do not rely solely 
on counting de novo LoF variants, rather we use novel statistical methods 
to assess association for autosomal genes by integrating de novo, inher- 
ited and case-control LoF counts, as well as de novo missense variants 
predicted to be damaging. For many samples original data from sequen- 
cing performed on lUumina HiSeq 2000 systems were used to call SNVs 
and indels in a single large batch using GATK (v2.6)^^. De novo muta- 
tions were called using enhancements of earlier methods (Supplemen- 
tary Information), with calls validating at extremely high rates. 

After evaluation of data quality, high-quality alternative alleles with 
a frequency of <0.1% were identified, restricted to LoF (frameshifts, 
stop gains, donor/acceptor splice site mutations) or probably damaging 
missense (Mis3) variants (defined by PolyPhen-2 (ref 18)). Variants were 
classified by type (de novo, case, control, transmitted, non-transmitted) 
and severity (LoF, Mis3), and counts tallied for each gene. 

Some 13.8% of the 2,270 ASD trios (two parents and one affected 
child) carried a de novo LoF mutation— significantly in excess of both 
the expected value^^ (8.6%, P < 10~^^) and what was observed in 510 
control trios (7.1%, P= 1.6 X 10~^) collected here and previously pub- 
lished^^. Eighteen genes (Table 1) exhibited two or more de novo LoF 
mutations. These genes are all known or strong candidate ASD genes, 
but given the number of trios sequenced and gene mutability^^'^^, we 



would expect to observe this in approximately two such genes by chance. 
While we expect only two de novo Mis3 events in these 18 genes, we 
observe 16 {P = 9.2X 10~^\ Poisson test). Because most of our data 
exist in cases and controls and because we observed an additional excess 
of transmitted LoF events in the 18 genes, it is evident that the optimal 
analytical framework must involve an integration of de novo mutation 
with variants observed in cases and controls and transmitted or untrans- 
mitted from carrier parents. Investigating beyond de novo LoFs is also 
critical given that many ASD risk genes and loci have mutations that 
are not completely penetrant. 

Transmission and de novo association 

We adopted TAD A (transmission and de novo association), a weighted, 
statistical model integrating de novo, transmitted and case- control vari- 
ation^°. TADA uses a Bayesian gene-based likelihood model including 
per-gene mutation rates, allele frequencies, and relative risks of particu- 
lar classes of sequence changes. We modelled both LoF and Mis3 sequence 
variants. Because no aggregate association signal was detected for inher- 
ited Mis3 variants, they were not included in the analysis. For each gene, 
variants of each class were assigned the same effect on relative risk. Using 
a prior probability distribution of relative risk across genes for each class 
of variants, the model effectively weighted different classes of variants 
in this order: de novo LoF > de novo Mis3 > transmitted LoF, and allowed 
for a distribution of relative risks across genes for each class. The strength 
of association was assimilated across classes to produce a gene-level Bayes 
factor with a corresponding FDR q value. This framework increases the 
power compared to the use of de novo LoF variants alone (Extended 
Data Fig. 2). 

TADA identified 33 autosomal genes with an FDR < 0.1 (Table 1) 
and 107 with an FDR < 0.3 (Supplementary Tables 2 and 3 and Extended 
Data Fig. 3). Of the 33 genes, 15 (45.5%) are known ASD riskgenes^; 11 
have been reported previously with mutations in ASD patients but were 
not classed as true risk genes owing to insufficient evidence (SUV420H1 
(refs 11, 15), ADiVP'", BCLlW, CACNA2D3 (refs 15, 21), CTTNBP2 
(ref 15), GABRB3 (ref 21), CDC42BPB'\ APHW\ NR3C2 (ref 15), 
SETD5 (refs 14, 22) and TRW) and 7 are completely novel (ASHIL, 
MLL3 (also known as KMT2C), ETFB, NAA15, MY09B, MIBl and VILl). 
ADNP mutations have recently been identified in 10 patients with ASD 
and other shared clinical features^^. Two of the newly discovered genes. 
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Table 1 | ASD risk genes 


dnLoF count FDR<0.01 


O.OK FDR < 0.05 


0.05 < FDR < 0.1 


>2 ADNP, ANK2, ARID IB, CHD8, CUL3, 
DYRKIA, GRIN2B, KATNAL2, POGZ, 
SCN2A, SUV420H1, SYNGAPl, TBRl 

1 
0 


ASXL3, BCLllA, CACNA2D3, MLL3 

CTTNBP2, GABRB3, PTEN, RELN 
MIBl 


ASHIL 

APHIA, CD42BPB, ETFB, NAA15, MY09B, MYTIL, 
NR3C2, SETD5, TRIO 
VI LI 



TADA analysis of LoF and damaging missense variants found to be de novo in ASD subjects, inlierited by ASD subjects, or present in ASD subjects (versus control subjects). dnLoF, de novo LoF events. 



ASHIL and MLL3, converge on chromatin remodelling. MY09B plays 
a key role in dendritic arborization^^. MIBl encodes an E3 ubiquitin 
ligase critical for neurogenesis^^ and is regulated by miR-137 (ref. 26), 
a microRNA that regulates neuronal maturation and is implicated in 
schizophrenia risk^^. 

When the WES data from genes with an FDR < 0.3 were evaluated 
for the presence of deletion copy number variants (CNVs) (such CNVs 
are functionally equivalent to LoF mutations), 34 CNVs meeting quality 
and frequency constraints (Supplementary Information) were detected 
in 5,781 samples (Extended Data Fig. 1). Of the 33 genes with an FDR 
< 0.1, 3 contained deletion CNVs mapping to 3 ASD subjects and one 
parent. Of the 74 genes meeting the criterion 0.1 < FDR < 0.3, about 
one-third could be false positives. Deletion CNVs were found in 14 of 
these genes and the data supported risk status for 10 of them (Extended 
Data Table 1 and Extended Data Fig. 4). Two of these ten, NRXNl and 
SHANK3y were previously implicated in ASD^'^'^°. The risk from dele- 
tion CNVs, as measured by the odds ratio, is comparable to that from 
LoF SNVs in cases versus controls or transmission of LoF variants 
from parents to offspring. 

Estimated odds ratios of top genes 

Inherent in our conception of the biology of ASD is the notion that 
there is variation between genes in their impact on risk; for a given 



class of variants (for example, LoF) some genes have a large impact, 
others smaller, and still others have no effect at all. In addition, mis- 
annotation of variants, among other confounds, can yield false variant 
calls in subjects (Supplementary Information). These confounds can 
often be overcome by examining the data in a manner orthogonal to 
gene discovery. For example, females have greatly reduced rates of ASD 
relative to males (a 'female protective effect'). Consequently, and regard- 
less of whether this is diagnostic bias or biological protection, females 
have a higher liability threshold, requiring a larger genetic burden before 
being diagnosed^^'^^'^^. A corollary is that if a variant has the same effect 
on autism liability in males as it does in females, that variant will be 
present at a higher frequency in female ASD cases compared to males. 
Importantly, the magnitude of the difference is proportional to risk as 
measured by the odds ratio; hence, the effect on risk for a class of variants 
can be estimated from the difference in frequency between males and 
females. 

Genes with an FDR < 0.1 show profound female enrichment for 
de novo events (P = 0.005 for LoF, P = 0.004 for Mis3), consistent 
with de novo events having large impacts on liability (odds ratio > 20; 
Extended Data Fig. 5). However, genes with an FDR between 0.1 and 
0.3 show substantially less enrichment for female events, consistent 
with a modest impact for LoF variants (odds ratio range 2-4, whether 
transmitted or de novo) and little to no effect from Mis3 variants. The 
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Figure 1 | ASD genes in synaptic networks. 

a, Enrichment of 107 TADA genes in: FMRP 
targets from two independent data sets^^'^^ and 
their overlap; RBFOX targets; RBFOX targets with 
predicted alterations in splicing; RBFOXl and 
H3K4me3 overlapping targets; genes with de novo 
mutations in schizophrenia (SCZ); human 
orthologues of Genes2Cognition (G2C) mouse 
synaptosome (SYN) or PSD genes; constrained 
genes; and genes encoding mitochondrial proteins 
(as a control). Red bars indicate empirical P values 
(Supplementary Information), b, Synaptic proteins 
encoded by TADA genes, c, De novo Mis3 
variants in Navl.2 {SCN2A). The four repeats 
(I-IV) with P-loops, the EF-hand, and the IQ 
domain are shown, as are the four amino acids 
(DEKA) forming the inner ring of the ion- 
selectivity filter, d, Variants in Ca^LS (CACNAID). 
Part of the channel is shown, including helices 
one and six (SI and S6) for domains I-IV, the 
NSCaTE motif, the EF-hand domain, the pre-IQ, 
IQ, proximal (PCRD) and distal (DCRD) 
C-terminal regulatory domains, the proline-rich 
region, and the PDZ domain-binding motif. 
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results are consistent with inheritance patterns: LoF mutations in 
FDR < 0.1 genes are rarely inherited from unaffected parents whereas 
those in the 0.1 < FDR < 0.3 group are far more often inherited than 
they are de novo mutations. 

By analysing the distribution of relative risk over inferred ASD 
genes^^, the number of ASD risk genes can be estimated. The estimate 
relies on the balance of genes with multiple de novo LoF mutations 
versus those with only one: the larger the number of ASD genes, the 
greater proportion that will show only one de novo LoF. This approach 
yields an estimate of 1,150 ASD genes (Supplementary Information). 
While there are many more genes to be discovered, many will have a 
modest impact on risk compared to the genes in Table 1. 

Enrichment analyses 

Gene sets with an FDR < 0.3 are strongly enriched for genes under 
evolutionary constraints^ (P = 3.0 X 10~^^; Fig. la and Supplementary 
Table 4), consistent with the hypothesis that heterozygous LoF muta- 
tions in these genes are ASD risk factors. Over 5% of ASD subjects carry 
de novo LoF mutations in our FDR < 0.3 Ust. We also observed that 
genes in the FDR < 0.3 list had a significant excess of de novo non- 
synonymous events detected by the largest schizophrenia WES study 
so far^° (P = 0.0085; Fig. la), providing further evidence for overlap- 
ping risk loci between these disorders and independent confirmation 
of the signal in the gene sets presented here. 

We found significant enrichment for genes encoding messenger RNAs 
targeted by two neuronal RNA-binding proteins: FMRP^^ (also known 
as FMRl), mutated or absent in fragile X syndrome {P= 1.20 X 10~^^, 
34 targets^ \ of which 1 1 are corroborated by an independent data set^^), 
and RBFOX (RBFOXl/2/3) (P = 0.0024, 20 targets, of which 12 overlap 
with FMRP), with RBFOXl shown to be a splicing factor dysregulated 
in ASD^^'^^ (Fig. la). These two pathways expand the complexity of 
ASD neurobiology to post-transcriptional events, including splicing 
and translation, both of which sculpt the neural proteome. 



We found nominal enrichment for human orthologues of mouse genes 
encoding synaptic (P = 0.031) and post-synaptic density (PSD) proteins^^ 
(P = 0.046; Fig. la, b and Supplementary Tables 4-6). Enrichment 
analyses for InterPro, SMART or Pfam domains (FDR < 0.05 and a 
minimum of five genes per category) reveal an overrepresentation of 
DNA- or histone-related domains: eight genes encoding proteins with 
InterPro zinc-fmger FYVE PHD domains (142 such annotated genes 
in the genome; FDR = 7.6 X 10"^), and five with Pfam Su(var)3-9, 
enhancer-of-zeste, trithorax (SET) domains (39 annotated in the gen- 
ome; FDR = 8.2 X 10"^). 

Integrating complementary data 

To implicate additional genes in risk for ASD, we used a model called 
DAWN (detecting association with networks)^^. DAWN evokes a hid- 
den Markov random field framework to identify clusters of genes that 
show strong association signals and highly correlated co-expression in 
a key tissue and developmental context. Previous research suggests human 
mid-fetal prefrontal and motor- somatosensory neocortex is a critical 
nexus for risk^^, thus we evaluated gene co-expression data from that 
tissue together with TADA scores for genes with an FDR < 0.3. Because 
this list is enriched for genes under evolutionary constraint, we general- 
ized DAWN to incorporate constraint scores (Supplementary Informa- 
tion). When TADA results, gene co-expression in mid-fetal neocortex 
and constraint scores are jointly modelled, DAWN identifies 160 genes 
that plausibly affect risk (Fig. 2), 91 of which are not in the 107 TADA 
genes with an FDR < 0.3. Moreover, the model parameter describing 
evolutionary constraint is an important predictor of clusters of putative 
risk genes (P = 0.018). 

A subnetwork obtained by seeding the 160 DAWN genes within a 
high-confidence protein-protein interactome^^ confirmed that the putat- 
ive genes are enriched for neuronal functions. We kept the largest con- 
nected component, containing 95 seed DAWN genes, 50 of which were 
in the FDR < 0.3 gene set. The DAWN gene products form four natural 

Figure 2 | ASD genes in neuronal 
networks. Protein-protein 
interaction network created by 
seeding TADA and DAWN- 
predicted genes. Only intermediate 
genes that are known to interact with 
at least two TADA and/or DAWN 
genes are included. Four natural 
clusters (C1-C4) are demarcated 
with black ellipses. All nodes are 
sized on the basis of degree of 
connectivity. 
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clusters on the basis of network connectivity (Fig. 2). We visualized the 
enriched pathways and biological functions for each of these clusters 
on 'canvases'^^ (Extended Data Fig. 6). Many of the previously known 
ASD risk genes fall in cluster C3, including genes involved in synaptic 
transmission and cell-cell communication. Cluster C4 is enriched for 
genes related to transcriptional and chromatin regulation. Many TAD A 
and DAWN genes in this cluster interact tightly with other transcrip- 
tion factors, histone-modifying enzymes and DNA-binding proteins. 
Five TAD A genes in the cluster C2 are bridged to the rest of the network 
through MAPT, as inferred by DAWN. The enrichment results for 
cluster C2 indicate that genes implicated in neurodegenerative disor- 
ders could also have a role in neurodevelopmental disorders. 

Emergent results 

Amongst the critical synaptic components found to be mutated in our 
study are voltage-gated ion channels involved in fundamental processes 
including the propagation of action potentials (for example, the Nayl .2 
channel), neuronal pacemaking and excitability- transcription coupling 
(for example, the Ca^LS channel) (Fig. lb). We identified four LoF and 
five Mis3 variants in SCN2A (Navl.2), three Mis3 variants in CACNAID 
(Cavl.3) and two LoF variants in CACNA2D3 (a25-3 subunit). Remark- 
ably, three de novo Mis3 variants in SCN2A affected residues mutated in 
homologous genes in patients with other syndromes, including Brugada 
syndrome {SCN5A) or epilepsy disorders (SCNIA) (Arg379His and Arg 
937His). These arginines, as well as the threonine mutated in Thrl420Met, 
cluster to the P -loops forming the ion selectivity filter, located in prox- 
imity to the inner ring (DEKA motif) (Fig. Ic). Because homologous 
channels mutated in these arginines do not conduct inward Na^ cur- 
rents^^'^^, Arg379His and Arg937His mutations might have similar effect. 

Two de novo CACNAID variants (Gly407Arg and Ala749Gly) emerged 
at positions proximal to residues mutated in patients with primary aldos- 
teronism and neurological deficits (Fig. Id). The reported mutations 
interfere with channel activation and inactivation^°. Amongst variants 
found in cases, Ala59Val maps to the NSCaTE domain, also important 
for Ca^^- dependent inactivation, and Serl977Leu and Arg2021His co- 
cluster in the carboxy- terminal proline-rich domain, the site of interac- 
tion with SHANK3, a key PSD scaffolding protein. Mutations in RIMSl 
and RIMBP2, which can associate with Cayl .3, were found in our cohort 
(but with an FDR > 0.3). 

Chromatin remodelling involves histone-modifying enzymes (encoded 
by histone-modifier genes, HMGs) and chromatin remodellers (read- 
ers) that recognize specific histone post-translational modifications and 
orchestrate their effects on chromatin. Our gene set is enriched in HMGs 
(9 HMGs out of 152 annotated in HIstome^\ Fisher's exact test, P = 
2.2 X 10~^). Enrichment in the gene ontology term 'histone -lysine N- 
methytransferase activity' (5 genes out of 41 so annotated; FDR = 
2.2 X 10~^) highlights this as a prominent pathway. 

Lysines on histones 3 and 4 can be mono-, di- or tri-methylated, 
providing a versatile mechanism for either activation or repression of 
transcription. Of 107 TAD A genes, five are SET lysine methyltransferases, 
four are jumonji lysine demethylases, and two are readers (Fig. 3a). 
RBFOXl co-isolates with histone H3 trimethyl Lys 4 (H3K4me3)'*^ and 
our data set is enriched in targets shared by RBFOXl and H3K4me3 
(P = 0.0166; Fig. la and Supplementary Table 4). Some de novo missense 
variants targeting these genes map to functional domains (Extended 
Data Fig. 7). 

For the H3K4me2 reader CHD8, we extended our analyses in search 
of additional de novo variation in the cases of the case-control sample. 
By sequencing complete parent-child trios for many CHD8 variants, 
five variants were found to be de novo, two of which affect essential 
splice sites and cause LoF by exon skipping or activation of cryptic splice 
sites in lymphoblastoid cells (Fig. 3b). 

Given the role of HMGs in transcription, we reasoned that TADA 
genes might be interconnected through transcription 'routes'. We searched 
for a connected network (seeded by 9 TADA HMGs) in a transcription 
factor interaction network (ChEA)'*^. We found that 46 TADA genes 
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Figure 3 | ASD genes in chromatin remodelling, a, TADA genes cluster to 
chromatin- remodelling complexes. Amino-terminals of histones H3, H4 
and part of H2A are shown. Lysine methyltransferases add methyl groups, 
whereas lysine demethylases remove them, b, De novo Mis3 and LoF variants in 
CHD8. The box shows the outcome of reverse transcription PGR and Sanger 
sequencing in lymphoblastoid cells for two newly identified de novo splice-site 
variants. The first mutation affects an acceptor splice site (red arrow), causing 
the activation of a cryptic splice site (red box), a four-nucleotide deletion, 
frame shift and a premature stop. The second mutation affects a donor splice 
site (red arrow), causing exon skipping, frame shift and a premature stop. 

are directly interconnected in a 55-gene cluster (Extended Data Fig. 8) 
(P = 0.002; 1,000 random draws), for a total of 69 when including all 
known HMGs (Fig. 4) (P = 0.001; 1,000 random draws). 

Examining the Human Gene Mutation Database we found that the 
107 TADA genes included 21 candidate genes for intellectual disabil- 
ity, 3 for epilepsy, 17 for schizophrenia, 9 for congenital heart disease 
and 6 for metabolic disorders (Fig. 5). 

Conclusions 

Complementing earlier reports, ASD subjects show a clear excess of 
de novo LoF mutations above expectation, with a concentration of such 
events in a handful of genes. While this handful has a large effect on risk, 
most ASD genes have a much smaller impact. This gradient emerges 
most notably from the contrast of risk variation in male and female ASD 
subjects. Unlike some earlier studies, but consistent with expectation, 
the data also show clear evidence for effect of de novo missense SNVs 
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Figure 4 | Transcription regulation 
network of TADA genes. Edges 
indicate transcription regulators 
(source nodes) and their gene targets 
(target nodes) based on the ChEA 
network; interactions among only 
HMGs are ignored. 
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on risk; for risk generated by LoF variants transmitted from unaffected 
parents; and for the value of case- control design in gene discovery. By 
integrating data on de novo, inherited and case-control variation, the 
yield of ASD gene discoveries was doubled over what would be obtained 
from a count of de novo LoF variants alone. ASD genes almost uni- 
formly show strong constraints against variation, a feature we exploit 
to implicate other genes in risk. 

Three critical pathways for typical development are damaged by 
risk variation: chromatin remodelling, transcription and splicing, and 
synaptic function. Chromatin remodelling controls events underlying 
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Figure 5 | Involvement in disease of ASD genes. The Venn diagram shows 
the overlap in disease involvement for the TADA genes. 



the formation of neural connections, including neurogenesis and neural 
differentiation^^, and relies on epigenetic marks as post-translational 
modifications of histones . Here we provide extensive evidence for HMGs 
and readers in sporadic ASD, implicating specifically lysine methyla- 
tion and extending the mutational landscape of the emergent ASD gene 
CHD8 to missense variants. Splicing is implicated by the enrichment of 
RBFOX targets in the top ASD candidates. Risk variation also affects 
multiple classes and components of synaptic networks, from receptors 
and ion channels to scaffolding proteins. Because a wide set of synaptic 
genes is disrupted in idiopathic ASD, it seems reasonable to suggest 
that altered chromatin dynamics and transcription, induced by disruption 
of relevant genes, leads to impaired synaptic function as well. De novo 
mutations in ASD^^"^^ intellectual disability^^ and schizophrenia^^ clus- 
ter to synaptic genes, and synaptic defects have been reported in models 
of these disorders'*^. Integrity of synaptic function is essential for neural 
physiology, and its perturbation could represent the intersection between 
diverse neuropsychiatric disorders^^. 

Online Content Methods, along with any additional Extended Data display items 
and Source Data, are available in the online version of the paper; references unique 
to these sections appear only in the online paper. 
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16 sample sets: 
3,976 ASD subjects (2,303 trios) 
6,059 unrelated controls 



Sequenced on lllumina and SOLID 

i 

Called SNV and indel 



Cleaned to 3,871 ASD subjects 



Called CNV in available BAMs: 
2,305 ASD subjects (1 ,456 trios) 
363 unrelated controls 



De novo obtained in 2,270 trios 
Transmission called in 1 ,298 trios 
Variants in 1,601 cases and 5,397 controls 



Filtered transmission and case-control 
calls to MAF < 0.001 



Tallied variants counts 



Filtered highly mutated genes 


1 
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TADA analysis 



Cleaned to 2,244 ASD subjects 
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Filtered to MAF < 0.001 



ASD risk genes: 
33 with q < 0.1; 107 with q < 0.3 



Downstream analyses 




Overlapped with ASD risk genes 
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Extended Data Figure 1 | Workflow of the study. The workflow began with 
16 sample sets, as Usted in Supplementary Table 1. DNA was obtained, and 
exomes were captured and sequenced. After variant calling, quality control was 
performed: duplicate subjects and incomplete families were removed and 
subjects with extreme genotyping, de novo, or variant rates were removed. 
Following cleaning, 3,871 subjects with ASD remained. Analysis proceeded 



separately for SNVs and indels, and CNVs. De novo and transmission/non- 
transmission variants were obtained for trio data (published de novo variants 
from 825 trios"'^^"^^ were incorporated). This led to the TADA analysis, which 
found 33 ASD risk genes with an FDR < 0.1; and 107 with an FDR < 0.3. CNVs 
were called in 2,305 ASD subjects. BAM, binary alignment/map; MAF, minor 
allele frequency. 
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Sample size 

Extended Data Figure 2 | Expected number of ASD genes discovered as a 
function of sample size. The multiple LoF test (red) is a restricted version of 
TADA that uses only the de novo LoF data. TADA (blue) models de novo LoF, 
de novo Mis3, LoF variants transmitted/not transmitted and LoF variants 
observed in case- control samples. The sample size (n) indicates either n trios for 
which we record de novo and transmitted variation (TADA), or n trios for 
which we record only de novo events (multiple LoF), plus n cases and n controls. 
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Extended Data Figure 3 | Heat map of the numbers of variants used in 
TADA analysis from each data set in genes with an FDR < 0.3. Left, variants 
in affected subjects; right, unaffected subjects. For the counts, we only included 
de novo LoF and Mis3 variants, transmitted/untransmitted and case-control 



LoF variants. These variant counts are normalized by the length of coding 
regions of each gene and sample size of each data set ( | trio | + | case | for the left, 
I trio I + I control I for the right). Description of the samples can be found in 
Supplementary Table 1. 
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Extended Data Figure 4 | Genome browser view of the CNV deletions 
identified in ASD-affected subjects. The deletions are displayed in red if with 
unknown inheritance, in grey if inherited, and in black in unaffected subjects. 



Deletions in parents are not shown. For deletions within a single gene, all 
splicing isoforms are shown. 
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Extended Data Figure 5 | Frequenq^ of variants by gender. Frequency of 
de novo (dn) and transmitted (Tr) variants per sample in males (black) and 
females (white) for genes with an FDR < 0.1 (top row), FDR < 0.3 (middle 



row), or all TADA genes (bottom row). The P values were determined by 
one-tailed permutation tests (*P < 0.05; < 0.01; ***P < 0.01). 
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Extended Data Figure 6 | Enrichment terms for the four clusters identified 
by protein-protein interaction networks. P values calculated using mouse- 
genome-informatics-mammalian-phenotype (MGI_Mammalian phenotype, 



blue), Kyoto encyclopaedia of genes and genomes (KEGG) pathways (red), and 
gene ontology biological processes (yellow) are indicated. 
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Extended Data Figure 7 | £>e novo variants in SET lysine methyltransferases bromo, bromodomain; FYR C, FY-rich C-terminal domain; FYR N, FY-rich 
and jumonji lysine demethylases. Mis3 variants are in black, LoF in red, and N-terminal domain; HiMG, high mobility group box; JmjC, jumonji C 
variants identified in other disorders in grey (Fig. 5). ARID, AT-rich interacting domain; JmjN, jumonji N domain; PHD, plant homeodomain; PWWP, Pro- 
domain; AWS, associated with SET domain; BAH, bromo adjacent homology; Trp-Trp-Pro domain; SET, Su(var)3-9, enhancer-of-zeste, trithorax domain. 



©2014 Macmillan Publishers Limited. All rights reserved 




Extended Data Figure 8 | Transcription regulation network of TAD A genes only. Edges indicate transcription regulators (source nodes) and their gene targets 
(target nodes) based on the ChEA network. 
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Extended Data Table 1 | CNVs hitting TADA genes 
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0.1 < q-value < 0.3: Evidence against role in ASD 
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Count of deletion CNVs inferred from sequence for ASD subjects and tliose unaffected by ASD. Number of subjects and family status: 849 ASD subjects without family information; 1,467 ASD subjects in families; 
2,766 unaffected parents; 319 unaffected siblings of ASD subjects; 373 unaffected subjects without family information. NT, parent a carrier but ON V not transmitted to affected child; Tr-ASD, transmitted to ASD 
subject from carrier parent; Tr-not-ASD, parent transmits a CNV to an unaffected child. 

* No parents in this count were affected; seven parents in the study were affected, none carried a CNV reported in the table and these subjects did not enter the calculation. 

tTo compute the odds ratio we count the number of affected carriers (a), unaffected carriers (including parents) (b), affected subjects who do not have the CNV (c), and unaffected non-carriers (d). The odds 
ratio = (ad)/{bc). 

J One parent transmits the CNV to an affected and unaffected offspring; to obtain the total count of controls with a CNV, subtract one. 

§ Genes are adjacent in the genome (see Extended Data Fig. 4). For three subjects both genes are affected by the same CNV (1 ASD and 2 unaffected subjects). 
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